Skip to content

Conversation

@schase-es
Copy link
Contributor

@schase-es schase-es commented Jun 18, 2025

TransportStartDatafeedAction previously tried to validate remote index cluster names in datafeed jobs, before checking if the local cluster had remote_cluster_client role. Because this role enables retrieval of the remote cluster names, the validation step would always fail with a no-such-cluster exception. This was confusing. This change moves the remote_cluster_client check ahead of cluster name validation, and adds a test.

Closes ES-11841
Closes: #121149

TransportStartDatafeedAction previously tried to validate remote index cluster
names in datafeed jobs, before checking if the local cluster had
remote_cluster_client role. Because this role enables retrieval of the remote
cluster names, the validation step would always fail with a no-such-cluster
exception. This was confusing. This change moves the remote_cluster_client check
ahead of cluster name validation, and adds a test.

Closes ES-11841
@schase-es schase-es added >non-issue :Distributed Coordination/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. backport auto-backport Automatically create backport pull requests when merged v8.19.0 v9.1.0 labels Jun 18, 2025
@schase-es
Copy link
Contributor Author

Including David Kyle as the last person who edited these, as an FYI... this area seems to have limited distributed coordination involvement in the past.

Copy link
Contributor

@nicktindall nicktindall left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some minor nits, but this LGTM. Please wait until we've got approval from someone in ML before merging though, because I'm not 100% familiar with this code.

(specific question to someone more familiar with the code would be - is this the best place to do the check? Node roles are immutable, so we could do the check when the job is created, but the master could change to one that does have the role between creation and starting, and then everything would be fine? so perhaps start is the right place?)

schase-es and others added 4 commits June 18, 2025 17:34
- changed doc url to link github issue
- added return after continuation send -- oops!
- use junit methods for testing message/exception matching
- moved the test to the xpack ml/qa folder
- added a gradle build file that appears to work
- removed excess override settings from RestTest
Copy link
Member

@davidkyle davidkyle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@schase-es schase-es merged commit cb451da into elastic:main Jun 20, 2025
27 checks passed
schase-es added a commit to schase-es/elasticsearch that referenced this pull request Jun 20, 2025
…elastic#129601)

TransportStartDatafeedAction previously tried to validate remote index cluster
names in datafeed jobs, before checking if the local cluster had
remote_cluster_client role. Because this role enables retrieval of the remote
cluster names, the validation step would always fail with a no-such-cluster
exception. This was confusing. This change moves the remote_cluster_client check
ahead of cluster name validation, and adds a test.

Closes ES-11841
Closes elastic#121149
@elasticsearchmachine
Copy link
Collaborator

💚 Backport successful

Status Branch Result
8.19

@schase-es
Copy link
Contributor Author

schase-es commented Jun 20, 2025

Great -- everything is merged. Thanks Nick as usualy, and David especially for reviewing something from another team -- and helping me get started with a gradle!

elasticsearchmachine pushed a commit that referenced this pull request Jun 21, 2025
…#129601) (#129802)

TransportStartDatafeedAction previously tried to validate remote index cluster
names in datafeed jobs, before checking if the local cluster had
remote_cluster_client role. Because this role enables retrieval of the remote
cluster names, the validation step would always fail with a no-such-cluster
exception. This was confusing. This change moves the remote_cluster_client check
ahead of cluster name validation, and adds a test.

Closes ES-11841
Closes #121149
kderusso pushed a commit to kderusso/elasticsearch that referenced this pull request Jun 23, 2025
…elastic#129601)

TransportStartDatafeedAction previously tried to validate remote index cluster
names in datafeed jobs, before checking if the local cluster had
remote_cluster_client role. Because this role enables retrieval of the remote
cluster names, the validation step would always fail with a no-such-cluster
exception. This was confusing. This change moves the remote_cluster_client check
ahead of cluster name validation, and adds a test.

Closes ES-11841
Closes elastic#121149
julian-elastic pushed a commit to julian-elastic/elasticsearch that referenced this pull request Jun 24, 2025
…elastic#129601)

TransportStartDatafeedAction previously tried to validate remote index cluster
names in datafeed jobs, before checking if the local cluster had
remote_cluster_client role. Because this role enables retrieval of the remote
cluster names, the validation step would always fail with a no-such-cluster
exception. This was confusing. This change moves the remote_cluster_client check
ahead of cluster name validation, and adds a test.

Closes ES-11841
Closes elastic#121149
@schase-es
Copy link
Contributor Author

FYI @kderusso and @valeriy42:

The error message is now a 400-response with an explanatory response:
method [POST], host [http://[::1]:63535], URI [_ml/datafeeds/test_datafeed/_start], status line [HTTP/1.1 400 Bad Request] {"error":{"root_cause":[{"type":"status_exception","reason":"Datafeed [test_datafeed] is configured with a remote index pattern(s) [remote_cluster:remote_index] but the current node [local_cluster-0] is not allowed to connect to remote clusters. Please enable node.remote_cluster_client for all machine learning nodes and master-eligible nodes."}],"type":"status_exception","reason":"Datafeed [test_datafeed] is configured with a remote index pattern(s) [remote_cluster:remote_index] but the current node [local_cluster-0] is not allowed to connect to remote clusters. Please enable node.remote_cluster_client for all machine learning nodes and master-eligible nodes."},"status":400}

The old error message I reproduced in testing (and which is now gone), was a 404 response:
method [POST], host [http://[::1]:63626], URI [_ml/datafeeds/test_datafeed/_start], status line [HTTP/1.1 404 Not Found] {"error":{"root_cause":[{"type":"no_such_remote_cluster_exception","reason":"no such remote cluster: [remote_cluster]"}],"type":"no_such_remote_cluster_exception","reason":"no such remote cluster: [remote_cluster]"},"status":404}

I did not write it to issue an IllegalArgumentException as mentioned earlier. While there are several code paths that do this, this API already had a check for this and an error message, but it was in the wrong order relative to cluster name validation.

mridula-s109 pushed a commit to mridula-s109/elasticsearch that referenced this pull request Jun 25, 2025
…elastic#129601)

TransportStartDatafeedAction previously tried to validate remote index cluster
names in datafeed jobs, before checking if the local cluster had
remote_cluster_client role. Because this role enables retrieval of the remote
cluster names, the validation step would always fail with a no-such-cluster
exception. This was confusing. This change moves the remote_cluster_client check
ahead of cluster name validation, and adds a test.

Closes ES-11841
Closes elastic#121149
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

auto-backport Automatically create backport pull requests when merged backport :Distributed Coordination/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. >non-issue v8.19.0 v9.1.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Improve exception handling when master node is missing remote_cluster_client role

4 participants